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We describe a massively parallel implementation of the recently developed discontinuous Galerkin 
density functional theory (DGDFT) method, for efficient large-scale Kohn-Sham DFT based elec¬ 
tronic structure calculations. The DGDFT method uses adaptive local basis (ALB) functions gen¬ 
erated on-the-fly during the self-consistent field (SGF) iteration to represent the solution to the 
Kohn-Sham equations. The use of the ALB set provides a systematic way to improve the accuracy 
of the approximation. It minimizes the number of degrees of freedom required to represent the so¬ 
lution to the Kohn-Sham problem for a desired level of accuracy. In particular, DGDFT can reach 
the planewave accuracy with far fewer numbers of degrees of freedom. By using the pole expansion 
and selected inversion (PEXSI) technique to compute electron density, energy and atomic forces, we 
can make the computational complexity of DGDFT scale at most quadratically with respect to the 
number of electrons for both insulating and metallic systems. We show that DGDFT can achieve 
80% parallel efficiency on 128,000 high performance computing cores when it is used to study the 
electronic structure of two-dimensional (2D) phosphorene systems with 3,500-14,000 atoms. This 
high parallel efficiency results from a two-level parallelization scheme that we will describe in detail. 


I. INTRODUCTION 

Kohn-Sham density functional theory (DFT) [2 [5] is 
the most widely used methodology for performing ab ini¬ 
tio electronic structure calculations to study the struc¬ 
tural and electronic properties of molecules, solids and 
nanomaterials. However, until recently, routine DFT cal¬ 
culations are only limited to small systems because they 
have a relatively high complexity with re¬ 

spect to the system size N. As the system size increases, 
the cost of traditional DFT calculations becomes pro¬ 
hibitively expensive. Therefore, it is still challenging to 
routinely use DFT calculations to treat large-scale sys¬ 
tems that may contain thousands to tens of thousands of 
atoms. Although various linear scaling 0{N) methods[31- 

have been proposed for improving the efficiency of DFT 
calculations, they rely on the nearsightedness principle, 
which leads to exponentially localized density matrices 
in real-space for systems with a finite energy gap or at 
finite temperature. Furthermore, most of the existing lin¬ 
ear scaling DFT codes, such as SIESTA, [6] OPENMX,[7] 
HONPAS,[5] CP2K[S] and CONQUEST,[in] are based 
on the contracted and localized basis sets in the real- 
space, such as Gaussian-type orbitals or numerical atomic 
orbitals. [4] The accuracy of methods based on such con¬ 
tracted basis functions are relatively difficult to improve 
systematically compared to conventional uniform basis 
sets, for example, plane wave basis set, m while the dis¬ 
advantage of using uniform basis sets is the relatively 
large number of basis functions per atom. 
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Another practical challenge of DFT calculations is re¬ 
lated to the implementation of various approaches to 
take full advantage of the massive parallelism available 
on modern high performance computing (HPC) architec¬ 
tures. Conventional DFT codes, especially, plane wave 
codes, such as VASP,[T2] QUANTUM ESPRESSO [l3] 
and ABINIT,[T3] have relatively limited parallel scala¬ 
bility even for large-scale systems involving thousands of 
atoms. The strong scaling performance (i.e., the change 
of computational speed with respect to the number of 
processing units for a problem of fixed size) of these 
planewave based codes is often poor when the num¬ 
ber of computational cores exceeds a few thousands. 
Nonetheless, improvements have been made in the past 
decade. For instance, Qbox[Tn] demonstrated high seal- 
ability to over 1,000 atoms using 131,072 cores on the 
IBM Blue Gene/L architecture (of which a factor of 8 
is due to fc-point parallelization). On the other hand, 
several DFT software packages based on contracted ba¬ 
sis functions have achieved high parallel performance us¬ 
ing linear scaling techniques for insulating systems. Ex¬ 
amples include GP2K[T5] and CONQUEST, [IT] in which 
linear scaling is achieved based on parallel sparse ma¬ 
trix multiplication. [T^-[2ni CP2K has demonstrated cal¬ 
culations on 96,000 water molecular scaling to 46,656 
cores. [TB] CONQUEST has demonstrated scaling to over 
2,000,000 atoms on 4,096 cores. [TT] 

The recently developed discontinuous Galerkin den¬ 
sity functional theory (DGDFT) aims at reducing 
the number of basis functions per atom while maintain¬ 
ing accuracy comparable to that of the planewave basis 
set. This is achieved by using a set of adaptive local ba¬ 
sis (ALB) functions, which are generated on-the-fly dur¬ 
ing the self-consistent field (SGF) iteration. The use of 
adaptively generated basis functions is also explored in 
other software packages such as ONETEP[15] and re- 
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cently BigDFT.[23] One unique feature of the ALB set 
is that each ALB function is strictly localized in a cer¬ 
tain element in the real space, and is discontinuous from 
the point of view of the global domain. The continu¬ 
ous Kohn-Sham orbitals and density are assembled from 
the discontinuous basis functions using the discontinuous 
Galerkin (DG) method. |241 E5] The ALB set takes both 
atomic and chemical environmental effects into account, 
and can be systematically improved just by increasing 
one number (number of ALBs per element). Numerical 
results suggest that it can achieve the same level of ac¬ 
curacy obtained by conventional plane wave calculations 
with much fewer number of basis functions. The solution 
produced by DGDFT is also fully consistent with the so¬ 
lution of standard Kohn-Sham equations in the limit of 
a complete basis set, and the error can be measured by 
a posteriori error estimators. |26j 

The strict spatial locality guarantees that the ALBs 
can form an orthonormal basis set and the overlap ma¬ 
trix is therefore an identity matrix, which requires only 
the solution of a standard eigenvalue problem rather than 
a generalized eigenvalue problem. The discrete Kohn- 
Sham Hamiltonian matrix is sparse. Furthermore, the 
sparsity pattern bears a resemblance to a block version 
of finite difference stencils, and facilitates parallel imple¬ 
mentation. The sparse discrete Hamiltonian matrix al¬ 
lows DGDFT to take advantage of the pole expansion and 
selected inversion (PEXSI) technique. [27H29] The PEXSI 
method overcomes the 0{N^) scaling limit for solving 
Kohn-Sham DFT, and scales at most as 0{N‘^) even for 
metallic systems at room temperature. In particular, the 
computational complexity of the PEXSI method is only 
0{N) for quasi ID systems, and is for quasi 

2D systems. This also makes the DGDFT methodology 
particularly suitable for analyzing low-dimensional (quasi 
ID and 2D) systems regardless whether the system is a 
metal, a semiconductor or an insulator. [28] different from 
the near-sightedness assumption in standard linear scal¬ 
ing DFT calculations. 

In this paper, we describe a massively parallel DGDFT 
method, based on a two-level parallelization strategy. 
This strategy results directly from the domain decom¬ 
position nature of the DGDFT method in which the 
global computational domain is partitioned into a num¬ 
ber of subdomains from which the ALBs are generated. 
We demonstrate the accuracy and high parallel efficiency 
of DGDFT by using two-dimensional (2D) phosphorene 
monolayers with 3,500-14,000 atoms as examples to show 
that DGDFT can take full advantage of up to 128,000 
cores on a high performance computer. 

The rest of paper is organized as follows. After brief 
introduction of the DGDFT method, we describe sev¬ 
eral important implementation considerations of the mas¬ 
sively parallel DGDFT method in section [H] We demon¬ 
strate the numerical performance of DGDFT for 2D 
phosphorene monolayers in section m followed by the 
conclusion in section IIVI 


II. DGDFT METHODOLOGY AND PARALLEL 
SCHEME 

A. Adaptive local basis set in a discontinuous 
Galerkin framework 

Since the main focus of this paper is to describe a 
highly efficient parallel implementation of the DGDFT 
methodology, we will not provide detailed theoretical 
derivation of the DGDFT method that can be found in 
Ref. [21] . We will briefly introduce the basic work flow of 
DGDFT, and the main steps and major computational 
bottlenecks that require parallelization. For detailed ex¬ 
planation of technical details, we refer readers to Ref. [2T| 
and a forthcoming publication |30j concerning the total 
energy and the force calculation in DGDFT. Throughout 
the paper we consider F-point calculation only. There¬ 
fore, the Kohn-Sham orbitals are real. We assume the 
system is closed shell. Spin degeneracy is omitted to 
simplify the notation in this section, but is correctly ac¬ 
counted for in the DGDFT code and the numerical sim¬ 
ulation. 

The goal of the DGDFT method is to find the min- 
imizer of the Kohn-Sham energy functional, which is a 
nonlinear functional in terms of N single particle wave- 
functions (orbitals) ipi. The minimizer of this functional 
is sought in a self-consistent field (SCF) iteration which 
produces electron density p that satisfies a fixed point 
(or self-consistent) nonlinear map. In a standard SCF 
iteration, a linear eigenvalue problem needs to be solved 
in each SCF cycle. As discussed in Ref. jH], the solution 
of this linear eigenvalue problem is usually the computa¬ 
tional bottleneck for solving Kohn-Sham DFT. 

In the DGDFT method, we partition the global com¬ 
putational domain D into a number of subdomains (called 
elements), denoted by Ei,, Em- An example of par¬ 
titioning the global domain of a model problem into a 
number of 2D elements is given in Fig. We use T to 
denote the collection of all elements. In the current ver¬ 
sion of DGDFT, we use periodic boundary conditions to 
treat both molecule and solids. However, the method can 
be relatively easily generalized to other boundary condi¬ 
tions such as Dirichlet boundary condition. Therefore 
each surface of the element must be shared between two 
neighboring elements, and S denotes the collection of all 
the surfaces. 

Each Kohn-Sham orbital is expanded into a linear com¬ 
bination of adaptive local basis functions (ALBs) ifiKj, 
i.e., 

M J K 

( 1 ) 

K=1 i=l 

where ipKjix) is the j-th ALB in the element Ex that 
has nonzero support only in Ex, and Jx is the total num¬ 
ber of ALBs in Ex- The basis function ^px,] is not nec¬ 
essarily continuous across the boundary of Ex- Because 
each px,j is assumed to be square integrable within Ex 
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and is zero outside of EK^ its inner product with other 
function can be defined in terms of an Lp' inner product 
defined on E^. As a result, a natural global inner prod¬ 
uct between quantities such as ipt, which we denote by 
(•, •)^, can be taken as the sum of local inner products 
dehned for all elements. Similarly, we can also define a 
global surface inner product, denoted by as the 

sum of local surface inner products defined on all sur¬ 
faces of all elements. The ALB set is orthonormal, i.e., 
fOT K,K' = j = / = l,...,J^cq 

= SK,K'5j,j'- ( 2 ) 

Because the expansion of ipi contains at least two ALBs 
localized in two adjacent elements with a shared surface 
on which neither ALB is continuous, the values of ipi on 
two sides of the surface can be different. This difference 
calls for the notion of average of gradient and 

the concept of jump of function value [[V'i]] defined on 
the surface. The use of average and jump operators dis¬ 
tinguishes DGDFT from other KS-DFT solvers. 

Using these notations, the total energy functional to 
be minimized with respect to the N Kohn-Sham orbitals 
{'4’i} subject to orthonormality condition can be written 
as 

1 ^ 

-UDG({'0i}) {^eS,p)r 

Na Li N 

+ 

1^1 

N 

W])5 

N 

+ W])^, 

( 3 ) 

where we use Ves to denote the effective one-body poten¬ 
tial (including local pseudopotential Uioc, Hartree poten¬ 
tial Vh and the exchange-correlation potential Vy^c[p]), 
the terms that contain bji and correspond to 

the nonlocal pseudopotential in the Kleinman-Bylander 
form, m and a is an adjustable penalty parameter to en¬ 
sure that Eq. ^ has a well dehned ground state energy. 
For each atom I, there are Lj functions {bjj}, called 
projection vectors of the nonlocal pseudopotential. The 
parameters { 7 / 7 } are real valued scalars. We refer the 
readers to Ref. [21] for more detailed explanation of the 
notation and theory. 

The procedure for constructing the ALBs and a de¬ 
tailed example of the ALBs will be given later in this 
subsection. For now we just treat Eq. Q as an ansatz 
for representing the Kohn-Sham orbitals. Minimizing the 
coefficients {Ci-Kj} subject to orthonormality condition 
gives rise to the Euler-Lagrange equation of Eq. ([^ which 


is a linear eigenvalue problem 

= PiCi-K’,j'- ( 4 ) 

KJ 

Here is the discrete Kohn-Sham Hamiltonian ma¬ 
trix, with matrix entries given by 

TirDG 

= (2 + iV’Kj' ,yeSPK,j)j-^5K,K' 

(5) 

+ (“2 

The E[^^ matrix can be naturally partitioned into ma¬ 
trix blocks as sketched in Fig. We call the submatrix 
H^9..k- of size Jk' x Jk the (K^jATj-th matrix block 
of or for short. The terms are grouped 

by three parentheses on the right hand side of Eq. ([^ 
to reflect different contributions to the DG Hamiltonian 
matrix, and will be treated differently in our parallel im¬ 
plementation of the method. The first group originates 
from the kinetic energy and the local pseudopotential, 
and only contributes to the diagonal blocks The 

second group comes from nonlocal pseudopotentials, and 
contributes to both the diagonal and off-diagonal blocks 
of H^'^. Since a projection vector of the nonlocal pseu¬ 
dopotential is spatially localized, we require the dimen¬ 
sion of every element along each direction (usually on 
the order of 6 ^ 8 Bohr) to be larger than the size of 
the nonzero support of each projection vector (usually 
on the order of 2 ^ 4 Bohr). Thus, the nonzero sup¬ 
port of each projection vector can overlap with at most 
2'^ elements as shown in Fig. (b) {d = 1, 2,3). As a re¬ 
sult, each nonlocal pseudopotential term may contribute 
both to the diagonal and the off-diagonal blocks of 
The third group consists of contributions from boundary 
integrals, and can also contribute to both the diagonal 
and off-diagonal blocks of Each boundary term 

involves only two neighboring elements by definition as 
plotted in Fig.[^(a). In summary, is a sparse matrix 
and the nonzero matrix blocks correspond to interactions 
between neighboring elements (Fig. [^. 

After constructing the matrix and solving the 

eigenvalue problem Q, the electron density required to 
evalute the effective potential Ueff in Eq. @ can be ob¬ 
tained from 

N 

pD = '^\9iD\^- ( 6 ) 

2=1 

It is well known that (§ is not the only way to com¬ 
pute the electron density. An alternative approach which 
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FIG. 1: (Color online) A model system in 2D partitioned into 
16 (4 X 4) equal sized elements, (a) An extended element Qe 
associated with the element Eq, and Qe includes elements Ei, 
E 2 , Ei, Ei, Ee, Er Eg, Eio and En. The four surfaces of Eg 
on which boundary integrals are computed are highlighted 
by arrows, (b) Two possible cases of projection vectors of 
the nonlocal pseudopotential, one contained a element, and 
another contained in multiple (here 4) elements, (c) The block 
structure of DG Hamiltonian matrix with blocks with nonzero 
values highlighted. 


does not require computing eigenvalues and eigenvec¬ 
tors of Hog is the pole expansion and selected inversion 
(PEXSI) method To make use of the PEXSI 

method, we need to express p{x) in terms of selected 
blocks of the density matrix represented in the ALB set 
(or density matrix for short in the following discussion). 
To be more precise, this density matrix has the form 

N 

PK,j-K',j' = (7) 

i=l 

and can be accurately approximated as a matrix function 


of Hog without knowing Ci-K,j explicitly. 

Using the density matrix, we can express the electron 
density as 

p{x) 

M Jk M Jk' / N 

= H VK,j{x)<pK',r(.x) 

K^l j^l j'^1 \i^l 

M Jk Jk 

^ X X X ^K,j{x)ipK,j'{x)PK,r,K,j'- 

K=1 j=l j' = l 

( 8 ) 

Here we have used the fact that each function ipK,j (x) is 
strictly localized in the element Hk to eliminate the cross 
terms involving both K and K'. As a result, the selected 
blocks, or more specifically, the diagonal blocks of the 
density matrix Pk-.k = Pk,:-,k,: are needed to evaluate 
the electron density. This is a key feature of this expres¬ 
sion that makes it possible to use the PEXSI method as 
will be discussed in section |II C| After self-consistency 
of the electron density is achieved in the SCF iteration, 
the total energy and atomic forces can also be evaluated 
using the PEXSI method. 

We have demonstrated that high accuracy in the to¬ 
tal energy and atomic forces can be achieved with a 
very small number (4—40) of basis functions per atom 
in DGDFT, compared to fully converged planewave 
calculations. EH [50] Besides selected blocks of the density 
matrix (DM), the Helmholtz free energy and the atomic 
force can be evaluated by computing selected blocks of 
the free energy density matrix (EDM) and the energy 
density matrix (EDM), respectively. The technique for 
computing EDM and EDM is very similar to that for 
computing the DM. [28] 

One notable feature of the ALB set is that they are 
generated on the fly, and is adaptive not only to the 
atomic but also the environmental information. In order 
to construct ALBs, we introduce, for each element Ek, 
an extended element Qk that contains Ex and a buffer 
region surrounding Ex- We define = Ves\Q^ to 

be the restriction of the effective potential at the current 
SCF step to Qx-, and to be the restriction 

of the nonlocal potential to Qx ■ These restricted poten¬ 
tials define a local Kohn-Sham linear eigenvalue problem 
on each extended element Qx- 

/ px,j{x)px, 3 '{x) dx = 6jj'- 

J Qk 

(9) 

This linear eigenvalue problem can be discretized by us¬ 
ing traditional basis sets such as plane waves and solved 
by an iterative method such as the locally optimal block 
preconditioned conjugate gradient (LOBPCG) method. 
We remark that SCF iterations need not and should not 
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be performed within each element Qk- The lowest Jk 
eigenvalues {^K,j}j=i and the corresponding eigenfunc¬ 
tions {^K,j}j=i are computed on Qx- We then restrict 

from Qk to Ek- The truncated vectors are 
not necessarily orthonormal. Therefore, we orthonormal- 
ize the set of truncated eigenvectors to obtain {‘PK,j}j=i- 
We then set each ipxj to zero outside of Ek, so that it 
defined over the entire domain, but is in general discon¬ 
tinuous across the boundary of Ek- These functions con¬ 
stitute the ALB set that we use to represent the Kohn- 
Sham Hamiltonian. Because they satisfy the orthonor¬ 
mality condition Q, the overlap matrix corresponding 
to the ALB set is an identity matrix. Hence, no general¬ 
ized eigenvalue problem with a potentially ill-conditioned 
overlap matrix need to be solved. For more details of the 
construction of ALBs, such as the restriction of the po¬ 
tential and the choice of boundary conditions for the local 
eigenvalue problem ([^, we refer readers to Ref. m- 

As an example. Fig. shows the ALBs of a 2D phos- 
phorene monolayer with 140 phosphorus atoms (Pi 4 o)- 
This is a two-dimensional system, and the global com¬ 
putational domain is partitioned into 64 equal sized el¬ 
ements along the Y and Z directions, respectively. For 
instance, the extended element Qio associated with the 
element Eiq contains elements Ei, E 2 , A 3 , Eg, An, A 17 , 
Aig and A 19 . We show the isosurfaces of the first three 
ALB functions for this element in Fig. [^a)-(c). Each 
ALB function shown is strictly localized inside Aio and 
is therefore discontinuous across the boundary of ele¬ 
ments. On the other hand, each ALB function is delocal¬ 
ized across a few atoms inside the element since they are 
obtained from eigenfunctions of local Kohn-Sham Hamil¬ 
tonian. Although the basis functions are discontinuous, 
the electron density is well-defined and is very close to be 
a continuous function in the global domain (Fig.j^d)). 

To summarize, the flowchart of the DGDFT method 
for solving Kohn-Sham DFT is shown in Fig. 


B. Two level parallelization strategy 

The DGDFT framework naturally lends itself to a 
two level parallelization strategy. At the coarse grained 
level, we distribute work among different processors by 
elements. We call this level of concurrency the inter¬ 
element parallelization. Within each element, the eigen¬ 
value solvers on each local (extended) domain and the 
construction of the DG Hamiltonian matrix can be fur¬ 
ther parallelized. This level of parallelization is called 
the intra-element parallelization. We use the Message 
Passing Interface (MPI) to handle data communication. 

We assume that the total number of processors is 
Ptot = M X Pe, where M is the number of elements, and 
Pe is the number of processors assigned to each element. 
We partition these processors into a 2D logical processor 
grid following a column major order as shown in Fig. 
We call this 2D processor grid the global processor grid 


£^10 



r 












K 


1 








1 ^ 


Sx( 






I X 








1 ^ 






) 


! ^ 










>« 

tot/ 







(a) 




FIG. 2: (Color online) The isosurfaces (0.04 Hartree/Bohr^) 
of the first three ALB functions belonging to the tenth element 
(Aio), (a) (fii, (b) 4)2, (c) (/!> 3 , and (d) the electron density 
p across in top and side views in the global domain in the 
example of P140. The red and blue regions indicate positive 
and negative isosurfaces, respectively. There are 64 elements 
and 80 ALB functions in each element in the P140 system. 


to distinguish it from other processor grids employed in 
various parts of the massively parallel DGDFT method. 
Each row (column) communicator of this grid is called a 
global row (column) communicator. The processors with 
ranks (K — l)Pe -b 1 to KPg {K = 1,..., M) are in the 
A-th global column processor group, and are assigned to 
element Ek for intra-element parallelization. Similarly, 
the processors with ranks i, Pe -\- i, ■ ■ ■, {M — l)Pe -b i 
{i = 1,..., Pe) are in the i-th global row processor group. 
When a very large number of processors are used, it is im¬ 
portant to avoid global all-to-all communication as much 
as possible in order to reduce communication cost and 
achieve parallel scalability. Therefore by dividing pro¬ 
cessors into column and row processor groups, we can 
restrict most of the communication within a global col¬ 
umn or row group. 

Based on the intra-element parallelization strategy to 
be detailed below, the maximum number of processors 
that can be effectively used for a single element depends 
on the number of basis functions to be generated on a sin¬ 
gle element, which is usually on the order of 100. It can 
be as large as a few hundreds. The level of concurrency 
that can be achieved in the inter-element parallelization 
is determined by the number of elements. The maximum 
number of processors that can be employed by DGDFT 
is therefore determined by the number of basis functions 
per element multiplied by the number of elements, which 
is equal to the dimension of the DG Hamiltonian matrix. 
For example, the P 140 system contains 140 phosphorus 
atoms and is partitioned into 64 (8 x 8 ) elements. There 
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FIG. 3: (Color online) Flowchart of the DGDFT method. 
There are three time consuming steps: the generation of ALB 
functions, the construction of DG Hamiltonian matrix, and 
the evaluation of the approximate charge density by either 
diagonalizing the DG Hamiltonian (DIAG) or by using the 
PEXSI method. H^fi and represent the global Kohn- 

Sham Hamiltonian operator for a given electron density, and 
the local Kohn-Sham Hamiltonian on the extended element 
Qk, respectively. 


are about 2 atoms in each element and we use 50 basis 
per element. The maximum number of processors that 
can be effectively used for this system is 3,600. For the 
2D phosphorene test problems (P 3500 and P 14000 ) used 
in this work, we partition the phosphorene systems into 
1,600 (40 X 40) and 6,400 (80 x 80) elements respectively. 
The maximum number of processors that can be effec¬ 
tively used for these systems are 80,000 and 320,000, re¬ 
spectively. Currently, we are often limited by the number 
of processors available on the existing high performance 
computers such as the Edison system at NERSC. The 
maximum number of processors we can use is 128,000. 
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FIG. 4: (Color online) Global processor grid for implementing 
the two level parallelization in the DGDFT method. M is 
the number of elements in the systems. R, is the number of 
processors given for each element. 


Generation of the ALB functions 

For each extended element, the computation of eigen¬ 
functions for the local Kohn-Sham Hamiltonian can be 
parallelized in a way similar to the parallelization of a 
standard planewave based Kohn-Sham DFT solver. The 
local Kohn-Sham orbitals form a local dense matrix ^^ 
of dimension Ng x Jk- Here Jk is the number of ALBs to 
be computed, and Ng is the total number of grid points 
required to represent each local Kohn-Sham orbital in the 
real space, which is determined from the kinetic energy 
cutoff by the following rule: 




TT 


( 10 ) 


Here Li is the length of the extended element along the i- 
th coordinate direction. The total number of grid points 
is A, = ni=i(-^s)i' ^ typical calculation Ng ~ 10® 

and Jk ~ 10^ as shown in Fig. 

On each extended element, we use the LOBPCG 
method [5^ to solve the local Kohn-Sham orbitals. The 
main operations involved in the LOBPCG solver are: 1) 
The application of the local Hamiltonian operator 
to '^K- 2) Dense matrix-matrix multiplication of the 

form $^ 41 , where matrices $, if have the same dimen¬ 
sion as the Kohn-Sham orbitals 4'^. 3) Dense matrix- 
matrix multiplication of the form $(7, where 4> has the 
same dimension as and C is a small matrix on the 
order of Jk x Jk- 4) The diagonalization of matrices of 
size 0[Jk) using the Rayleigh-Ritz procedure. 5) The 
application of a preconditioner operator. 




































7 


(a) 


(O 

O 


B) 


102 


Pi 


...p. 


Pi 


Pe 


(b) 


MPI Alltoallv 
< -=- *■ 


Pi 


Column cyclic partition 
For Hamiltonian operator 


Row block partition 
For matrix multiplication 


FIG. 5: (Color online)Two different types of partition of 'i’K- 
The partition shown in (a) is used for computing 
The partition shown in (b) is used when matrix-matrix mul¬ 
tiplications of the form and "^kC are performed. 


It should be noted that the these different types of 
operations require different data distribution and task 
parallelization strategies. Operation 1) requires applying 
the Laplacian operator via the Fast Fourier Transform 
(FFT), the local pseudopotential operator and the non¬ 
local pseudopotential operator to This can be done 
easily if the orbitals (i.e., columns of k) are partitioned 
along the column direction (see Fig. (a)). Since each 
processor holds all entries of an orbital, the FFT can 
be done in the same way as in a sequential implementa¬ 
tion. We further assume that each processor associated 
with the extended element Qx has a copy of the local 
pseudopotential , and the nonlocal pseudopotential 
Therefore all operations described in 1) can be 
performed in the same way as those in a sequential im¬ 
plementation. 

The most efficient way to parallelize operations 2) and 
3) is to partition by row blocks as shown in Fig. 
(b). This is because for operation 2), one can compute 
the matrix inner product of each block locally on each 
processor, and then use MPLAllreduce among the Pe 
processors to sum up local products of size Jx x Jx- 
For operation 3), there is no communication at all if all 
Pe processors have a local copy of the C matrix. Parti¬ 
tioning by columns would incur more communication 
cost and make this part of the computation less scalable. 
Since in each LOBPCG iteration we apply the Hamilto¬ 
nian operator to Ik/f once, but perform 12 matrix-matrix 
multiplication of operation type 2) and 7 matrix-matrix 
multiplication of operation type 3), it is worthwhile to 
switch back and forth between a column partition and 
row partition of Ikin between the first and other types 
of operations performed in each LOBPCG iteration. This 


can be achieved by using a MPLAlltoallv call. 

For operation 4), since Jx is usually on the order of 
hundreds in practice, we solve the Rayleigh-Ritz prob¬ 
lem and perform subspace diagonalization locally on each 
processor. Numerical experiments indicate that this se¬ 
quential part usually takes around or less than 1 second. 

Finally, in our implementation we use the precondi¬ 
tioner proposed in Ref. [33] to accelerate the convergence 
of LOBPCG. The preconditioner can be easily applied to 
different orbitals simultaneously without communication. 
Thus a column partition of x is suitable for applying 
the preconditioner in parallel. 

Once the Kohn-Sham orbitals x are constructed 
through LOBPGG, they are restricted from the extended 
element Qx to the element Ex- After orthonormaliz- 
ing columns of the restricted '^x, we obtain the ALBs 
denoted by = [ipx,iT ■ ■ i^k,Jk\- We remark that 
it is not necessary to compute the local wavefunctions 
'ki<- to full accuracy before the electron density becomes 
self consistent in the SCF iteration. As the accuracy of 
the electron density improves during the SCF cycle, a 
more accurate x can be obtained from running a few 
iterations of the LOBPCG procedure that uses the ^x 
returned from the previous SCF iteration as a starting 
guess. In practice, we find that using 3 preconditioned 
LOBPCG iterations per SCF iteration is often sufficient 
to achieve rapid convergence of the SCF procedure. Our 
numerical results indicate that the overall procedure for 
generating the ALBs is highly efficient. For instance, for 
the phosphorene P 3500 system, the total time for gener¬ 
ating the ALBs is only 1.35 and 0.71 sec by using 12,800 
and 25,600 processors, respectively. 

Construction of the DG Hamiltonian 

Due to the spatial locality of the ALBs, the DG Hamil¬ 
tonian matrix in Eq. ([^ is a sparse matrix and has a 
block structure that can be naturally distributed among 
different column processor groups assigned to different 
elements as shown in Fig. (c), i.e. the processors as¬ 
signed to the element Ex assembles the AT-th block row 
of E[^^. The construction of the DG Hamiltonian matrix 
consists of the evaluation of volume integrals within each 
element and surface integrals at the boundary of differ¬ 
ent elements. To achieve high accuracy, all integrals in 
([^ are evaluated by a Gaussian quadrature defined on 
a Legendre-Gauss-Labotto (LGL) grid in each element. 
The gradients of ALBs sampled on the LGL grid points 
and denoted by can be evaluated by applying dif¬ 

ferentiation matrices to ^x along the x,y,z directions, 
respectively. To compute efficiently, both ^x and 
should be partitioned and distributed by columns 
among processors assigned to Ex- We refer readers to 
e.g. Ref. [31] for details of constructing differentiation 
matrices associated with Gaussian quadratures. 

The construction of the K-th block row of con¬ 
sists of the following three steps that correspond to the 
three terms grouped by parentheses on the right hand 
side of Eq. (§. 1) Compute contributions to the diagonal 
matrix blocks from the kinetic energy and Ves terms. 2) 

















Compute contributions to the diagonal and off-diagonal 
matrix blocks from the nonlocal pseudopotential term. 3) 
Compute contributions to the diagonal and off-diagonal 
matrix blocks from the boundary integral terms. 

In step 1), it is generally more efficient to partition 
columns of among different processors as shown in 
Fig. I^a) because applying a differentiation matrix, or 
local pseudopotential I4ff to different columns of re¬ 
quires no communication. The inner products (•, •) 7 -, 
however, are evaluated as vector inner products (or a 
matrix-matrix multiplication when several integrals are 
evaluated simultaneously). A block row partition is more 
efficient for these types of operations. Collective commu¬ 
nication is required to sum up local products. To accom¬ 
modate both data distribution schemes in this step, we 
use the MPLAlltoallv call within a global column proces¬ 
sor group to convert and from column partition 
to row partition. 

More sophisticated data communication is needed for 
step 2. This is because the projection vector of a nonlo¬ 
cal pseudopotential may be distributed among several el¬ 
ements (up to 8, see Fig.[^(b)). If the nonzero support of 
the projection operator is completely within one element 
EK^ this projection operator only contributes to one di¬ 
agonal block Otherwise it contributes also to some 

off-diagonal matrix blocks for all Ek'^s that con¬ 

tain a distributed portion of the projection vector. Unlike 
the local pseudopotential Ves, the nonlocal pseudopoten¬ 
tial does not need to be updated during the SCF itera¬ 
tion. Therefore, it is efficient for processors associated 
with the element Ej^ to own a distributed portion of the 
projection vector bj^i{x) on the LGL grid constructed 
on Ek, if bi^i{x) does not vanish on Ex- This allows 
the inner product {bi^i,(fx,j)j- to be entirely evaluated 
on processors associated with Ex without further inter¬ 
element communication. The computation of the matrix 

element {‘Pk', j',bi,e)-j-{bi^£,ipx,j)j-'j however, 

requires data communication between processors associ¬ 
ated with Ex and Ex', on which bj^e does not vanish. 
This is done by communicating the inner products of 
the form (&/,f, fK,j)-j- among such neighboring elements. 
Since the size of such inner products is independent of 
the LGL grid size, the communication volume of this 
step is relatively low. For inter-element parallelization, 
we use asynchronous data communication routines with 
MPI Jsend and MPI Jrecv for efficient data communica¬ 
tion. 

The boundary integrals that appear in step 3 also re¬ 
quire inter-element communication. However, the inter¬ 
element data communication only occurs among two 
neighboring elements Ex, Ex' if the two elements share 
a common surface S. It should be noted that the compu¬ 
tation of average and jump operators only requires val¬ 
ues of functions on the surface S, and therefore only the 
function values of ^x and ^x' together with their gra¬ 
dients and restricted to S need to be com¬ 

puted. These calculations require much lower commu¬ 


nication volume compared to those required in volume 
integrals. We also remark that the communication per¬ 
formed in this step can be overlapped with computation 
to further reduce the cost of communication. In partic¬ 
ular, we first launch the communication needed to carry 
out step 3 before starting to perform the computational 
tasks in step 1, which does not require inter-element data 
communication. 

Numerical results indicate that the overall procedure 
for constructing the DG Hamiltonian matrix is highly 
efficient. For instance, for the phosphorene P 3500 sys¬ 
tem, the total wall clock time used to construct the DG 
Hamiltonian matrix is only 1.11 and 0.84 sec when the 
construction is carried out on 12,800 and 25,600 proces¬ 
sors, respectively. 

Computation of the electron density 

As discussed in section [H^ the electron density can 
be assembled from the eigenvector coefficients {ci;x,j}, 
or from the diagonal matrix blocks of the density matrix 
Pk-,k directly. Both options are supported in DGDFT. 
When the eigenvector option is used, we diagonalize 
the DG Hamiltonian matrix (referred to as the ”DIAG” 
method) by using the ScaLAPACK software package. [35] 
In order to use ScaLAPAGK, we need to convert the 
block row partition of the DG Hamiltonian matrix (see 
Fig. 0 to the 2D block cyclic data distribution scheme re¬ 
quired by ScaLAPACK. The eigenvectors returned from 
ScaLAPACK, which are stored in the 2D block cyclic 
format, are redistributed according to the block row par¬ 
tition used to distribute . We developed a routine 
to perform these conversions. Such conversion is the only 
operation in DGDFT that involves MPI communication 
among, in principle, all the available processors. All 
other MPI communication is performed either within the 
global row processor group or the global column proces¬ 
sor group. We note that a similar conversion procedure 
also can be found in other electronic structure software 
packages such as SIESTA|6] and CP2K[T6] when ScaLA¬ 
PACK is used. 

Once the eigenvectors are redistributed by elements, 
the electron density can be evaluated locally on each ele¬ 
ment. The global electron density is imply the collection 
of the density defined on each local element. Such global 
electron density is never collected to be stored a single 
processor, but is distributed across processors in a global 
row processor group. 

It should be noted that for large systems, all ScaLA¬ 
PACK diagonalization routines (such as the divided and 
conquer routine PDSYEVD currently used in DGDET) 
have limited parallel scalability. They often cannot make 
efficient use of the maximum number of processors (which 
can be 10, 000 ^ 100, 000 or more) that can be used by 
other parts of the DGDFT calculation. Therefore, we 
often need to restrict the ScaLAPACK calculation to a 
subset of processors to avoid getting sub-optimal perfor¬ 
mance. 
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C. Pole expansion and selected inversion method 

When the electron density is computed from the ex¬ 
pression given in (§, we use the recently developed pole 
expansion and selected inversion (PEXSI) method P71- 
m] to compute the diagonal blocks of the density ma¬ 
trix. This technique avoids the diagonalization proce¬ 
dure which has an 0{N^) complexity. It is accurate 
for both insulating and metallic systems. Furthermore, 
the computational complexity of the PEXSI method is 
only 0{N) for ID systems, for 2D systems, 

and 0{N'^) for 3D systems. Therefore, the PEXSI 
method is particularly well suited for studying electronic 
structures of larges scale low-dimensional (ID and 2D) 
systems. [36l [37] 

The PEXSI method is based on approximating the 
density matrix by a linear combination of Green’s func¬ 
tions, i.e., 

Q 

, ( 11 ) 

1=1 

where the integration weights loi and shifts zi are cho¬ 
sen carefully so that the number of expansion terms Q is 
proportional to log(/3Ai?), where /3 is the inverse temper¬ 
ature and AE is the spectrum width of . In practi¬ 
cal calculations, we observe that it is often sufficient to 
choose AE to be much smaller than the true spectrum 
width of thanks to the exponential decay of the 

Fermi-Dirac function above the Fermi energy. In most 
cases, it is sufficient to choose Q = 40 ~ 80. If we only 
need the diagonal blocks of the density matrix P, we do 
not need to compute the entire inverse of — zil. 
Only the diagonal blocks of — zil)~^ need to be 

computed, and these diagonal blocks can be computed ef¬ 
ficiently by using the selected inversion method. [27] The 
use of selected inversion leads to favorable complexity of 
the PEXSI method. 

The PEXSI method can scale to 10,000 to 100,000 pro¬ 
cessors. This has recently been demonstrated in the mas¬ 
sively parallel SIESTA-PEXSI method ]^ 136] . SIESTA- 
PEXSI uses local atomic orbitals to discretize the Kohn- 
Sham Hamiltonian. Because DGDFT is designed to take 
advantage of massively parallel computers, the high seal- 
ability of PEXSI, in addition to its lower asymptotic 
complexity, makes it a more attractive option compared 
to the diagonalization option. When the PEXSI option 
is used, DGDFT can often scale to 10,000 ~ 100,000 
processors, and be used to solve the electronic structure 
problems with more than 10, 000 atoms efficiently. m 

In order to use PEXSI in DGDFT, the DG Hamilto¬ 
nian matrix needs to be redistributed from a block row 
distribution format to a distributed compressed sparse 
column format (DCSG). When the DGSG format is used 
to distribute a sparse Hamiltonian matrix among M pro¬ 
cessors, each processor holds roughly [n/MJ columns 
stored in a compressed sparse column (CSC) format, 
where n is the dimension of the matrix. The diagonal 


blocks of the density matrix returned from PEXSI, which 
are stored in DCSG format, are converted back to the 
block row partition format used to represent the electron 
density. 

Because the selected inversions of the shifted DG 
Hamiltonians — zil) are independent of each other 

for different poles zj, we can carry out these selected 
inversions among different global row processor groups. 
Therefore, unlike the data redistribution scheme used for 
ScaLAPACK. As a result, the procedure for redistribut¬ 
ing required by PEXSI uses asynchronous commu¬ 
nication only within a global row processor group. 

Because each parallel selected inversion requires pro¬ 
cessors to be arranged logically into a 2D grid of size 
(#PEXSIProcRow)x(#PEXSIProcCol), the processors 
within each global row processor group is reconfigured in 
PEXSI as shown in Fig. Note that restricting the se¬ 
lected inversion at each pole to processors belonging to a 
global processor row group limits the maximum number 
of processors that can be used for each selected inver¬ 
sion to the number of elements M. However, for systems 
of large size, M can be easily 1000 or more. Therefore 
such a configuration does not severely limit the number 
of processors that can be effectively used for selected in¬ 
version. If the number of processors (Pe) within each 
global column processors group is larger than the num¬ 
ber of poles Q, all selected inversions required in the 
pole expansion 0 can be carried out simultaneously. 
The number of processors that can effectively be used in 
PEXSI is MQ. If Q > Pe, we have to process at most 
Pe poles at a time, and repeat the process until selected 
blocks of {H — zil)~^ have been computed for all zi’s. 
In this case and when all M processors in a global pro¬ 
cessor row group are used for selected inversion, all MPg 
processors can be effectively used by PEXSI. 


III. COMPUTATIONAL RESULTS 

In this section, we report the performance and accu¬ 
racy of DGDFT when it is applied to 2D phosphorene 
systems of different sizes. 

Phosphorene, a new two dimensional (2D) elemental 
monolayer. [381 - 142] has received considerable amount of 
interest recently after it has been experimentally isolated 
through mechanical exfoliation from bulk black phospho¬ 
rus. Phosphorene exhibits some remarkable electronic 
properties superior to graphene, a well known elemen¬ 
tal sp^-hybridized carbon monolayer. [43IH5] For exam¬ 
ple, phosphorene is a direct semiconductor with a high 
hole mobility. [38] It has the drain current modulation up 
to 10® in nanoelectronics. [39] These remarkable proper¬ 
ties have already been used for wide applications in field 
effect transistors [101 and thin-film solar cells. m Further¬ 
more, up to now, phosphorene is the only stable elemen¬ 
tal 2D material which can be mechanically exfoliated in 
experiments |35| besides graphene. Therefore, it can po¬ 
tentially be used as an alternative to graphene in the 
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FIG. 6: (Color online) The processor grid used in the PEXSI 
method. The processors used in PEXSI is usually a subset 
of all available processors. ^(^Pole represents the number of 
poles used in the PEXSI method. The inset shows that each 
ID global row processor group is further partitioned into a 
2D processor group used for each selected inversion. 


future and lead to faster semiconductor electronics. 

Fig.i shows the atomic configuration of 2D phospho- 
rene monolayer in a 5 x 7 supercell (Pi 4 o). Other phos- 
phorene models in very large supercells involving thou¬ 
sands or tens of thousands of atoms, such as the 25 x 
35 (P 3500 ) and 50 x 70 (P 14000 ) supercells, which we use 
as test problems in this work, are not shown here. The 
vacuum space in the X and Y directions is about 10 A 
to separate the interactions between neighboring slabs in 
phosphorene. 

All calculations are performed on the Edison plat¬ 
form available at the National Energy Research Scien¬ 
tific Computing (NERSC) center. Edison has 5462 Cray 
XC30 nodes. Each node has 24 cores partitioned among 
two Intel Ivy Bridge processors. Each 12-core processor 
runs at 2.4GHz, and has 64 GB of memory per node. 
The maximum number of available cores is 131,088 on 
Edison. In all calculations, we utilize all 24 cores on a 
computational node. 


A. Computational accuracy 

We use the conventional plane wave software pack¬ 
age ABINIT[T3] as a reference to check the accu¬ 
racy of results from DGDET. The same exchange- 
correlation functional of the local density approximation 
of Goedecker, Teter, Butter (LDA-Teter93) [50] and the 
Hartwigsen-Goedecker-Hutter (HGH) norm-conserving 


pseudopotential |57j are adopted in both ABINIT and 
DGDET software packages. 

We first check the accuracy of the total energy and 
the atomic force of the DGDET method by using P 140 
shown in Fig. as an example. To simplify our dis¬ 
cussion, we define the total energy error per atom AE 
(Hartree/atom) and maximum atomic force error AF 
(Hartree/Bohr) as 

AE = (aDGDft _ £;AbiniT)/^^^ 

and 

AF = max _ ^abiniT|_ 

respectively, where Na is the total number of atoms. 
FDGDft ^ABINIT represent the total energy com¬ 
puted by DGDET and ABINIT respectively, and 
^DGDFT ^ABINIT j-gpresent the Hellmann-Feynman 
force on the I-th phosphorus atom in P 140 computed by 
DGDET and ABINIT, respectively. The ABINIT results 
are obtained by setting the energy cutoff to 200 Hartree 
for the wavefunction to ensure full convergence. The ki¬ 
netic energy cutoff (denoted by Ecut) in the DGDET 
method is used to defined the grid size for computing the 
ALBs as is in standard Kohn-Sham DFT calculations 
using planewave basis sets. Ecut is also directly related 
to the Legendre-Gauss-Lobatto (LGL) integration grid 
defined on each element and used to perform numerical 
integration as needed to construct the DG Hamiltonian 
matrix. The number of LGL grids per direction is set 
to be twice the number of grid points calculated using 
Eq. ( [To] ) with the same Ecut. 

Table |l] shows that the total energy and atomic forces 
produced by the DGDET method are highly accurate 
compared to the ABINIT results. In particular, the to¬ 
tal energy error AE can be as small as 3.39 x 10“® 
Hartree/atom if the DIAG method is used and 8.12x 10“® 
Hartree/atom if the PEXSI method is used respectively. 
Here 50 poles are used and the accuracy of PEXSI can be 
further improved by increasing the number of poles.The 
maximum error of the atomic force can be as small 
as 1.06 X 10“^ Hartree/Bohr when DIAG is used and 
1.06 X 10“"^ Hartree/Bohr when PEXSI is used. These 
results are obtained when only a relatively small num¬ 
ber of ALB functions per atom are used to construct the 
global DG Hamiltonian. The energy cutoff for construct¬ 
ing the ALBs is set to 200 Hartree in this case. Note that 
the accuracy of total energy and atomic force in DGDET 
depends on both the energy cutoff for local wavefunctions 
defined on an extended element and the number of ALB 
functions. We can see from Table jl] that the accuracy of 
the total energy and atomic forces both improve as the 
energy cutoff and the number of ALB functions increases. 
We also find that the use of the Hellmann-Feynman force 
can result in accurate force calculation, despite that the 
contribution from the Pulay force[48] is not included. 

In the following parallel efficiency tests, we set the en¬ 
ergy cutoff to 40 Hartree for ALBs and 36.57 ALB func¬ 
tions per atom (80 ALB functions per element), which 
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TABLE I: The accuracy of DGDFT in terms of the total 
energy error per atom AE (Hartree/atom) and the maxi¬ 
mum atomic force error AF (Hartree/Bohr) in the DIAG 
and PEXSI methods with different kinetic energy cutoff Ecut 
(Hartree), and number of ALB functions per atom, compared 
with converged ABINIT calculations. ALB means the num¬ 
ber of ALB functions per atom. 


DGDFT Pi4o 

DIAG 

PEXSI 

Ecut 

#ALB 

AE 

AF 

AE 

AF 

20 

91.43 

5.22E-04 

4.03E-03 

3.22E-04 

4.03E-03 

40 

18.28 

4.51E-02 

5.97E-02 

4.57E-02 

5.97E-02 

40 

27.42 

6.67E-04 

2.51E-03 

6.85E-04 

2.52E-03 

40 

36.57 

1.34E-04 

6.16E-04 

1.59E-04 

6.18E-04 

40 

45.71 

7.00E-05 

4.00E-04 

6.44E-05 

5.23E-04 

40 

91.43 

-4.32E-07 

5.93E-04 

1.34E-04 

5.93E-04 

100 

91.43 

1.59E-05 

1.97E-04 

8.04E-05 

1.97E-04 

200 

91.43 

3.39E-06 

1.06E-04 

8.12E-05 

1.06E-04 


achieves good compromise between accuracy and compu¬ 
tational efficiency. For this particular choice of the energy 
cutoff and the number of ALB functions, we are able to 
keep the total energy error under 1 x 10“'* Hartree/atom 
and atomic force error under 1 x 10“^ Hartree/Bohr for 
2D phosphorene systems. 


B. Parallel efficiency 

To illustrate the parallel scalability of the DGDFT 
method, we demonstrate the performance of three main 
steps in each SCF iteration as shown in Fig. (a) the 
generation of ALB functions, (b) the construction of DG 
Hamiltonian matrix and (c) the evaluation of the approx¬ 
imate charge density, energy and atomic forces by either 
diagonalizing the DG Hamiltonian (DIAG) or by using 
the PEXSI technique. Note that there are some addi¬ 
tional steps such as the computation of energy, charge 
mixing or potential mixing, and intermediate data com¬ 
munication etc. The cost of these extra steps is included 
in the total wall clock time. 

Fig.0 shows the strong parallel scaling of these three 
individual steps, as well as the overall DGDFT method, 
for two large-scale 2D phosphorene monolayers (P 3500 
and P 14000 ) in terms of the wall clock time per SGF 
step. For P 3500 , we tested the performance using both the 
DIAG and PEXSI methods for the evaluation of charge 
density during SGF. But for P 14000 ; we only use the 
PEXSI method, since the DIAG method is too expen¬ 
sive for systems of such size (the dimension of the matrix 
is 512,000). 

The wall clock time of the first two steps are inde¬ 
pendent of whether PEXSI or DIAG is used. Eig. [^a) 
and (b) show that they both scale well with respect to 
the number of cores used in the computation for all test 
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FIG. 7: (Color online) The wall clock time with respect to the 
number of cores used for the computation for 2D phospho¬ 
rene monolayer systems of different sizes (P3500 and Pi4ooo). 
Strong scaling of (a) the generation of ALB functions step, (b) 
the DG Hamiltonian matrix construction step, (c) the eval¬ 
uation of the approximate charge density, energy and forces 
from the constructed DG Hamiltonian matrix, (d) the overall 
computational time. The reported wall clock time is for one 
SCF iteration. The timing and scaling shown in (c) and (d) 
depend on whether DIAG (hollow markers) or PEXSI (solid 
markers) is used to evaluate physical quantities such as charge 
density, energy and forces. 


problems we used. The exception is the construction 
of the DG Hamiltonian which does not scale beyond 
10^ processors for Pi 4 ooo- We find that the only rou¬ 
tine does not scale well is the inter-element communi¬ 
cation of the boundary values of and which 

is currently implemented via asynchronous communica¬ 
tion. Since the volume of the asynchronous communica¬ 
tion is proportional to the system size, for a large system 
the communication volume may exceed the size of the 
MPI buffer which leads to sub-optimal performance of 
the asynchronous data communication. Nonetheless, the 
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cost of the generation of ALBs and the construction of 
the DG matrix is much less compared to that for com¬ 
puting the electron density from the DG Hamiltonian. 

Fig. I^c) and (d) show that the evaluation of the ap¬ 
proximate charge density using the DG Hamiltonian ma¬ 
trix dominates the total wall clock time per SGF iter¬ 
ation in the DGDFT methodology for systems of large 
sizes. For large-scale 2D phosphorene systems P3500 and 
Pi4800, the PEXSI method can effectively reduce the wall 
clock time compared to the DIAG method in the DGDFT 
methodology. Furthermore, using the DIAG method 
with ScaLAPACK. [55] appears to limit the strong paral¬ 
lel scalability to at most 10,000 cores on the Edison. In¬ 
creasing the cores beyond that can lead to an increase in 
wall clock time. In contrast, the PEXSI method exhibits 
highly scalable performance. It can make efficient use of 
about 100,000 cores on Edison for P3500 and Pi4ooo- 

Fig-i shows the speedup and parallel efficiency with 
respect to the number of cores used for the computation 
for 2D phosphorene P3500 and Pi 4 ooo- For the P3500 sys¬ 
tem, both the DIAG and PEXSI methods in DGDFT 
can keep highly parallel efficiency (90% for DIAG and 
80% for PEXSI) with less than 10,000 cores. But for the 
DIAG method, further increase of the number of cores 
will lead to rapid loss of parallel efficiency. On the con¬ 
trary, the PEXSI method is highly scalable, and its par¬ 
allel efficiency is about 80% even when the number of 
cores increases beyond 80,000 for the P3500 system. For 
the large-scale P14000 system, we only use the PEXSI 
method in DGDET, and we find that the parallel effi¬ 
ciency of the DGDET-PEXSI method around 80% when 
128,000 cores are used on Edison (Edison has 131,088 
cores in total). 



FIG. 8: (Color online) The (a) speedup and (b) parallel effi¬ 
ciency with respect to the number of cores used for the com¬ 
putation for 2D phosphorene monolayer systems of different 
sizes (P3500 and Pmooo)- 


IV. CONCLUSIONS 


We described a massively parallel implementation of 
the DGDFT (Discontinuous Galerkin Density Functional 
Theory) methodology that can be used to perform large- 
scale Kohn-Sham density functional theory (DFT) calcu¬ 
lations efficiently. We demonstrated the accuracy and ef¬ 
ficiency of our parallel implementation. In particular, we 
showed that DGDFT can achieve accuracy comparable to 
that produced by a conventional planewave based calcu¬ 
lation with far fewer number of basis functions. We also 
showed that DGDFT can efficiently use 128,000 compu¬ 
tational cores to solve a problem with over 10, 000 atoms. 
The high parallel efficiency results from a two-level paral¬ 
lelization schemes that make use of several different types 
of data distribution, task scheduling and data commu¬ 
nication schemes. It also benefits from the use of the 
PEXSI method to compute electron density, energy and 
atomic forces. The PEXSI method has a favorable com¬ 
putational complexity and is also amenable to a two-level 
parallelization scheme that enables it to achieve high par¬ 
allel efficiency. 
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